Fingerprint-based Similarity Search and its Applications
نویسندگان
چکیده
This paper introduces a new technology and tools from the field of text-based information retrieval. The authors have developed – a fingerprint-based method for a highly efficient near similarity search, and – an application of this method to identify plagiarized passages in large document collections. The contribution of our work is twofold. Firstly, it is a search technology that enables a new quality for the comparative analysis of complex and large scientific texts. Secondly, this technology gives rise to a new class of tools for plagiarism analysis, since the comparison of entire books becomes computationally feasible. The paper is organized as follows. Section 1 gives an introduction to plagiarism delicts and related detection methods, Section 2 outlines the method of fuzzy-fingerprints as a means for near similarity search, and Section 3 shows our methods in action: It gives examples for near similarity search as well as plagiarism detection and discusses results from a comprehensive performance analyses. 1 Plagiarism Analysis Plagiarism is the act of claiming to be the author of material that someone else actually wrote (Encyclopædia Britannica 2005), and, with the ubiquitousness
منابع مشابه
Maximum Common Substructure-Based Data Fusion in Similarity Searching
Data fusion has been shown to work very well when applied to fingerprint-based similarity searching, yet little is known of its application to maximum common substructure (MCS)-based similarity searching. Two similarity search applications of the MCS will be focused on here. Typically, the number of bonds in the MCS, as well as the bonds in the two molecules being compared, are used in a simila...
متن کاملStatistical modeling of value distributions of similarity coefficients in virtual screening and its application to predicting fingerprint search performance
Similarity searching using fingerprints is a popular ligandbased virtual screening approach. The Tanimoto coefficient (Tc) is the most widely used measure for quantifying fingerprint similarity. In general, it is very difficult to assess the significance of the similarity of two molecules solely based on their calculated Tc values. In the literature, Tc cut-off values are frequently intuitively...
متن کاملTarget enhanced 2D similarity search by using explicit biological activity annotations and profiles
BACKGROUND The enriched biological activity information of compounds in large and freely-accessible chemical databases like the PubChem Bioassay Database has become a powerful research resource for the scientific research community. Currently, 2D fingerprint based conventional similarity search (CSS) is the most common widely used approach for database screening, but it does not typically incor...
متن کاملFingerprint Indexing and Verification
This paper presents fingerprint indexing based on graph information of minutiae, fingerprint classification and verification based on hierarchical agglomerative clustering technique. The proposed fingerprint indexing is invariant under translation and rotation. Its performance is evaluated in terms of several real-life datasets. The fingerprint database is clustered into five classes based on t...
متن کاملFuzzy-Fingerprints for Text-Based Information Retrieval
This paper introduces a particular form of fuzzy-fingerprints—their construction, their interpretation, and their use in the field of information retrieval. Though the concept of fingerprinting in general is not new, the way of using them within a similarity search as described here is: Instead of computing the similarity between two fingerprints in order to access the similarity between the as...
متن کامل